Autoregressive processes naturally arise in a large variety of real-world scenarios, including e.g., stock markets, sell forecasting, weather prediction, advertising, and pricing. When addressing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for converge to the optimal decision policy. In this work, we propose a novel online learning setting, named Autoregressive Bandits (ARBs), in which the observed reward follows an autoregressive process of order $k$, whose parameters depend on the action the agent chooses, within a finite set of $n$ actions. Then, we devise an optimistic regret minimization algorithm AutoRegressive Upper Confidence Bounds (AR-UCB) that suffers regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2} \right)$, being $T$ the optimization horizon and $\Gamma < 1$ an index of the stability of the system. Finally, we present a numerical validation in several synthetic and one real-world setting, in comparison with general and specific purpose bandit baselines showing the advantages of the proposed approach.
translated by 谷歌翻译
Behavioral Cloning (BC) aims at learning a policy that mimics the behavior demonstrated by an expert. The current theoretical understanding of BC is limited to the case of finite actions. In this paper, we study BC with the goal of providing theoretical guarantees on the performance of the imitator policy in the case of continuous actions. We start by deriving a novel bound on the performance gap based on Wasserstein distance, applicable for continuous-action experts, holding under the assumption that the value function is Lipschitz continuous. Since this latter condition is hardy fulfilled in practice, even for Lipschitz Markov Decision Processes and policies, we propose a relaxed setting, proving that value function is always Holder continuous. This result is of independent interest and allows obtaining in BC a general bound for the performance of the imitator policy. Finally, we analyze noise injection, a common practice in which the expert action is executed in the environment after the application of a noise kernel. We show that this practice allows deriving stronger performance guarantees, at the price of a bias due to the noise addition.
translated by 谷歌翻译
This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.
translated by 谷歌翻译
随着全球经济和市场的持续增长,资源不平衡已成为实际逻辑场景中的核心问题之一。在海洋运输中,这种贸易不平衡导致空容器重新定位(ECR)问题。一旦将货物从出口国交付到进口国,Laden将变成空容器,需要重新定位以满足出口国中新商品请求。在这样的问题中,任何合作重新定位政策的绩效都可以严格取决于船舶将遵循的路线(即车队部署)。从历史上看,提出了行动研究(OR)方法,以与船只一起共同优化重新定位政策。但是,容器的未来供应和需求的随机性以及环境中存在的黑框和非线性约束,使这些方法不适合这些情况。在本文中,我们介绍了一个新颖的框架,可配置的半POMDP,以建模这种类型的问题。此外,我们提供了一种两阶段的学习算法“配置和征服”(CC),该算法首先通过找到最佳机队部署策略的近似来配置环境,然后通过在此调整后的这种调整中学习ECR政策来“征服”它环境环境。我们在这个问题的大型和现实世界中验证了我们的方法。我们的实验强调,CC避免了或方法的陷阱,并且成功地优化了ECR政策和船队的船队,从而在世界贸易环境中取得了出色的表现。
translated by 谷歌翻译
由于新的数据智能技术,仓库管理系统一直在不断发展和改进。但是,许多当前的优化已应用于特定情况,或者非常需要手动相互作用。这是强化学习技术发挥作用的地方,提供自动化和适应当前优化政策的能力。在本文中,我们介绍了一个可自定义的环境,它概括了用于强化学习的仓库模拟的定义。我们还验证了这种环境,以防止最新的增强学习算法,并将这些结果与人类和随机政策进行比较。
translated by 谷歌翻译
在终身环境中学习,动态不断发展,是对电流加强学习算法的艰难挑战。然而,这将是实际应用的必要特征。在本文中,我们提出了一种学习超策略的方法,其输入是时间,输出当时要查询的策略的参数。此超级策略验证,以通过引入受控偏置的成本来最大限度地提高估计的未来性能,有效地重用过去数据。我们将未来的性能估计与过去的绩效相结合,以减轻灾难性遗忘。为避免过度接收收集的数据,我们派生了我们嵌入惩罚期限的可差化方差。最后,我们在与最先进的算法相比,在逼真的环境中,经验验证了我们的方法,包括水资源管理和交易。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
在当今智能网络物理系统时代,由于它们在复杂的现实世界应用中的最新性能,深度神经网络(DNN)已无处不在。这些网络的高计算复杂性转化为增加的能源消耗,这是在资源受限系统中部署大型DNN的首要障碍。通过培训后量化实现的定点(FP)实现通常用于减少这些网络的能源消耗。但是,FP中的均匀量化间隔将数据结构的位宽度限制为大值,因为需要以足够的分辨率来表示大多数数字并避免较高的量化误差。在本文中,我们利用了关键见解,即(在大多数情况下)DNN的权重和激活主要集中在零接近零,只有少数几个具有较大的幅度。我们提出了Conlocnn,该框架是通过利用来实现节能低精度深度卷积神经网络推断的框架:(1)重量的不均匀量化,以简化复杂的乘法操作的简化; (2)激活值之间的相关性,可以在低成本的情况下以低成本进行部分补偿,而无需任何运行时开销。为了显着从不均匀的量化中受益,我们还提出了一种新颖的数据表示格式,编码低精度二进制签名数字,以压缩重量的位宽度,同时确保直接使用编码的权重来使用新颖的多重和处理 - 积累(MAC)单元设计。
translated by 谷歌翻译
在本文中,我们提出了TAC2POSE,这是一种特定于对象的触觉方法,从首次触摸已知对象的触觉估计。鉴于对象几何形状,我们在模拟中学习了一个量身定制的感知模型,该模型估计了给定触觉观察的可能对象姿势的概率分布。为此,我们模拟了一个密集的物体姿势将在传感器上产生的密集对象姿势的接触形状。然后,鉴于从传感器获得的新接触形状,我们使用使用对比度学习学习的对象特定于对象的嵌入式将其与预计集合进行了匹配。我们从传感器中获得接触形状,并具有对象不足的校准步骤,该步骤将RGB触觉观测值映射到二进制接触形状。该映射可以在对象和传感器实例上重复使用,是唯一接受真实传感器数据训练的步骤。这导致了一种感知模型,该模型从第一个真实的触觉观察中定位对象。重要的是,它产生姿势分布,并可以纳入来自其他感知系统,联系人或先验的其他姿势限制。我们为20个对象提供定量结果。 TAC2POSE从独特的触觉观测中提供了高精度的姿势估计,同时回归有意义的姿势分布,以说明可能由不同对象姿势产生的接触形状。我们还测试了从3D扫描仪重建的对象模型上的TAC2POSE,以评估对象模型中不确定性的鲁棒性。最后,我们证明了TAC2POSE的优势与三种基线方法进行触觉姿势估计:直接使用神经网络回归对象姿势,将观察到的接触与使用标准分类神经网络的一组可能的接触匹配,并直接的像素比较比较观察到的一组可能的接触接触。网站:http://mcube.mit.edu/research/tac2pose.html
translated by 谷歌翻译
机器学习算法支撑现代诊断辅助软件,这在临床实践中证明了有价值的,特别是放射学。然而,不准确的是,主要是由于临床样本的可用性有限,用于培训这些算法,妨碍他们在临床医生中更广泛的适用性,接受和识别。我们对最先进的自动质量控制(QC)方法进行了分析,可以在这些算法中实现,以估计其输出的确定性。我们验证了识别磁共振成像数据中的白质超收缩性(WMH)的大脑图像分割任务上最有前途的方法。 WMH是在上层前期成年中常见的小血管疾病的关联,并且由于其变化的尺寸和分布模式而尤其具有挑战性。我们的研究结果表明,不确定度和骰子预测的聚集在此任务的故障检测中最有效。两种方法在0.82至0.84的情况下独立改善平均骰子。我们的工作揭示了QC方法如何有助于检测失败的分割案例,从而使自动分割更可靠,适合临床实践。
translated by 谷歌翻译